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Abstract 

Regular path query languages for data graphs are essentially un¬ 
typed. The lack of type information greatly limits the optimization 
opportunities for query engines and makes application development 
more complex. In this paper we discuss a simple, yet expressive, 
schema language for edge-lahelled data graphs. This schema lan¬ 
guage is, then, used to define a query type inference approach with 
good precision properties. 

Categories and Subject Descriptors H.2.1 [Database Manage¬ 
ment} : Logical Design 

General Terms Theory, Languages 

Keywords RPQs, type inference, data graphs 

1. Introduction 

In the last few years graph databases gained more and more rele¬ 
vance in application areas such as the Semantic Web, social net¬ 
works, bioinformatics, network traffic analysis, and crime detec¬ 
tion. This led to the definition of many query formalisms for graph 
databases, like, for instance, regular path queries (RPQs llal), 
nested regular expressions (NREs conjunctive regular path 

queries (CRPQs [□]), GXPath fl^. and their derivatives. All these 
languages are based on the idea of specifying regular expressions 
describing paths in the input graph, and can be considered, to 
some extent, a generalization of existing path query languages for 
semistructured data (see XPath d, for instance). Regular path 
query langu^es are often used in other graph query languages, 
like Cypher d or PQL d> to specify patterns in variable binding 
clauses. 

Regular path query languages are essentially untyped. This 
means that one cannot statically infer the structure of query re¬ 
sults {type inference), check if the results satisfy a given schema 
{type-checking), and verify if the query results would always be 
empty {query correctness). Furthermore, the lack of type informa¬ 
tion greatly limits the optimization opportunities for query engines 
and makes application development more complex. 

Our Contribution In this paper we describe a simple, yet expres¬ 
sive, schema language for edge-labelled data graphs. A schema is 
formed by a collection of schema elements, each one describing 
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the set of incoming and outgoing edges of a class of graph nodes; 
edges are specified through regular expressions. Unlike what hap¬ 
pens in other schema languages for graphs EH] m, that allow the 
designer to describe in full detail the structure of outgoing edges as 
well as the structure of node values, but give her very limited mod¬ 
elling choices for incoming edges, our schema language makes no 
distinction between incoming and outgoing edges, and gives the 
designer the same modeling tools for both classes of edges. 

This increased expressive power has the drawback that, as we 
will show in Section in the general case, schema emptiness 
checking is undecidable; hence, a few restrictions on regular ex¬ 
pressions describing edges are needed in order to ensure that the 
semantics of graph schemas is well defined, and to make schema 
emptiness decidable. The resulting class of schemas is named well- 
formed schemas and can be viewed as a generalization of DTDs 
ca to data graphs. The proposed language is a first step towards 
the definition and analysis of even more powerful schema lan¬ 
guages for data graphs. 

In the second part of the paper we leverage on well-formed 
schemas to build a type inference system, working in polynomial 
time, for RPQs, NREs, and GXPath queries with good soundness 
and completeness properties; in particular, this type inference sys¬ 
tem is sound and complete on RPQs, while completeness has to be 
relaxed on NREs and GXPath queries. This means that, by using 
this system, it is possible to decide whether an RPQ is satisfiable 
on graphs conforming to a given schema in polynomial time. 

Paper Outline The paper is structured as follows. In Section 
we first describe the data model and the type language used in our 
approach; then, we present a schema language for data graphs and 
discuss the emptiness problem for the resulting schemas. In Sec¬ 
tion]^ next, we survey regular path query languages and describe 
their semantics. In Section |4] then, we present our type inference 
systems. In Sections]^ andwe discuss some related works and 
draw our conclusions. 

2. Preliminary Definitions 

2.1 Data Model and Type Language 

Following oa , we model a data graph as an edge-labelled graph, 
as shown below. 

Definition 2.1 (Data Graph) Given a finite alphabet S and a 
(possibly) infinite value domain D, a data graph G over E and 
D is a triple G = {V, E, p), where: 

• V is a finite set of nodes; 

• E C V X D X V is a set of labelled, directed edges {vi, a, Vj); 

• p : V D is a mapping from nodes to values. 

Given a node v, we indicate with in{v) and out{v) the set of 
incoming and outgoing edges, respectively. Formally: 

• in{v) = {{v', a,v) G E \ v' £ V A a € E}; 



• out{v) = {{v,a,v') & E \ v' /\ a ^ E}. 

We assume that sequences of outgoing (incoming) edges of a 
node are unordered, as it is often the case in graph databases. Given 
a set of edges Se G E, we will indicate with 'k{Se) the unordered 
concatenation of the labels of the edges in Se- 

This data model is general enough to capture many practical use 
graphs, ranging from RDF data to social network graphs, as shown 
by the following example. 

Example 2.2 Consider the graph shown in Figure [T] This graph 
contains bibliographic information coming from a fragment of the 
RDF representation of the DBLP repository fill. As in flhll . we 
indicate the value of a node inside its graphical representation, and 
use RDF properties to label edges. ■ 


In this work we propose a schema language for data graphs that 
associates to each schema element a pair of regular expressions 
describing sequences of labels of the incoming and outgoing edges 
of each node. Regular expressions obey the following grammar: 

T t\a\T + T\ T-T\T* 


where e denotes the empty sequence, a is a symbol in E, + and ■ 
denote, respectively, union and unordered concatenation, and * is 
the Kleene star. As expected, unordered concatenation • is commu¬ 
tative, associative and has e as neutral element. In particular, the 
expression Ti • ... ■ T„ is equivalent to all of its possible permuta¬ 
tions. In the following we will also use T"*' and T? as abbreviations 
for T* ■ T and T -f e. 

The semantics of regular expressions is denoted as L{—), de¬ 
noting the minimal function satisfying the following equations: 


L(e) 

L{a) 

L(Ti -f Ta) 
L(Ti • Ta) 
L{T*) 


{4 

{a} 

L(Ti) UL(ra) 
L(Ti).L(ra) 
[J,>o L{Ty 


where Li • La denotes unordered language concatenation and is 
defined in the obvious way, while for any i € N, L* = L ■ 
with L° = {e}. 


The semantics of a schema element e is defined as follows. 

|e| = {w I 7 r(m(t))) e L(e.m) A 7 r(oMf(u)) £ L{e.out)'\ 

A schema element, then, specifies constraints on the incoming and 
outgoing edges of a node. Consider, for instance, the following 
schema element: 


e = {a ■ b ■ {c + d),e ■ h*) 

This element describes graph nodes having an incoming a-edge, 
an incoming 6 -edge, as well as an incoming edge labelled with c or 
d; these nodes must also have an outgoing e-edge together with 
zero or more outgoing /i-edges. 

In our schema language, hence, we not only impose constraints 
on outgoing edges, but also on incoming edges. This is in contrast 
to what happens in schema languages for XML data (e.g., DTDs 
II and XML Schema El). This choice is motivated by the ob¬ 
servation that in a graph each vertex may have multiple incom¬ 
ing edges and, hence, multiple fathers, while in an XML tree each 
node, except for the root, has a single father. Therefore, it is im¬ 
portant to give the schema designer the ability to model the set of 
incoming edges, so to avoid potentially dangerous situations. Con¬ 
sider, for instance, a data graph describing a bibliographic database, 
where nodes can represent books, papers, authors, and publishers; 
of course, while author nodes can have incoming edges labelled 
with “writtenBy”, they cannot allow for incoming edges with la¬ 
bel “publishedBy”, which, instead, are allowed for publisher nodes 
only. 

The use of regular expressions for modeling incoming edges 
makes our language ^ite different from existing graph schema 
languages like TSL 1^ and SheX EUl . In all these languages, 
the designer can use regular expressions to specify the sequence 
of outgoing edges for each node type; each edge is described by 
a label and by the type of the receiving node. Therefore, in these 
languages it is not possible to specify, for instance, that a node of a 
given type can have exactly one incoming edge of a given kind. 

Definition 2.5 (Graph Schemas) A graph schema S is a finite set 
of schema elements {eiliLo such that: 

1. Vi G [0..n].V( G sym{ei.in).3j G [0..n].( G sym{ej.out); 

2. Vi G [0..n].Vi G syra{ei.cmt). 3j G [0..n].Z G sym{ej.in); 

3. \/i,j G [0..n] : (ei.in Cl ej.in = 0 V a.out Cl Cj-out = 0). 

Conditions 1 and 2 above are necessary to ensure that the 
schema cannot define graphs with dangling edges: any symbol 
used in an outgoing edge must also be used to label an incoming 
edge, and vice versa. As we will see later, these conditions are 
not sufficient to imply non-emptiness. Condition 3 guarantees the 
uniqueness of node typing: a graph node can be typed by at most 
one schema element. 

Schema semantics is defined as follows. 

Definition 2.6 (Graph Schema Semantics) A data graph G — 
{V, E, p) over E and T> is described by a graph schema S 
(G G [<S|) if and only for each v € V there exists d £ S such that 
V G Idl 


2.2 Schema Language 

Regular expressions are the building blocks of our schema lan¬ 
guage. 

Definition 2.3 Given a regular expression T over E, sym{T) is 
the set of symbols in E appearing in T. 

Definition 2.4 (Graph Schema Element) Given a finite alphabet 
E, a schema element e over E is a pair {e.in, e.ouf), where e.in 
and e.out are regular expressions over E. 


Example 2.7 Consider again the graph of Example |2.2l This graph 
can be typed by the schema S = {ei, 62 , 63 , 64 , es}, where: 

ei = {e, {journal + partOf) ■ {creator)^) 

62 = {journal* ,e) 

63 = {partOf*, series) 

64 = {series*,e) 

65 = {creator*,e) 



















2.3 Schema Emptiness 

A graph schema, even though it satisfies all the properties of Defi- 
nition l 23 ] may be empty, and it could be difficult for the user to fig¬ 
ure out whether the schema she has defined is empty. For a simple 
schema like the following one, emptiness can be easily detected. 

Example 2.8 Consider the graph schema S = {ei, 62}, where: 

ei = (e, a ■ b ■ c ■ c) 

62 = {a ■ b ■ c,e) 

This schema satisfies conditions 1-3 of Definition l 2 . 5 l However, 
it is empty as cardinality constraints expressed by regular expres¬ 
sions of incoming and outgoing edges are incompatible. ■ 

For some schemas, checking compatibility between incoming 
and outgoing edges can be far from being obvious, as happens for 
the following one. 

Example 2.9 Consider the graph schema S = {ei, 62, 63}, where: 

61 = {e, a ■ b ■ c ■ c ■ c ■ c) 

62 = {a-b ■ c,t) 

63 = (c • c, e) 

In this schema each 61 node produces 4 outgoing c-edges, that 
are consumed by 62 and 63 nodes. This schema is not empty, as it 
possible to build a well-formed graph comprising 2 ei nodes, 2 62 
nodes, and 3 63 nodes. ■ 

Without imposing restrictions on the class of regular expres¬ 
sions being used, checking the emptiness of a schema is not decid¬ 
able. To show this undecidability result, it is necessary to establish 
an equivalence between graph schemas and homogeneous systems 
of linear diophantine equations with parameters. Indeed, we asso¬ 
ciate to each schema element a distinct variable, and build, for each 
symbol, a polynomial equation describing the produced and con¬ 
sumed edges labelled with that symbol. Each symbol equation con¬ 
tains the variables of the schema elements producing or consuming 
edges labelled with that symbol; the coefficient of each variable de¬ 
scribes the number of produced or consumed edges. The result is an 
homogeneous system which has a non-zero natural solution if and 
only if the schema is not empty. The following example illustrates 
this approach. 

Example 2.10 Consider again the schema of Example 12.81 This 
empty schema consists of two schema elements (ei and 62) to 
which we can associate variables x and y. Regular expressions in 
the schema use three different symbols (a, b, and c), so we have to 
define the following three linear equations: 

a. X — y = 0 

b. X — y = 0 

c. 2 x — y — 0 

In the first equation variable x has coefficient 1 , as ei produces 
an a-edges, while variable y has coefficient —1 since 62 consumes 
an a-edge. As it can be easily seen, the only solution of this system 
is (0,0,0). 

Consider now the schema of Example | 2 . 9 l As illustrated before, 
this schema is not empty and comprises three schema elements 
(61, 62, and 63) to which we can associate variables x, y, and z. 
As for the previous example, we have three distinct symbols in 
the schema, so we can define a system with the following linear 
equations: 

a. X — y = Q 

b. X — y — 0 

c. 4 x — y — 2 z = 0 


It easy to see that ( 2 , 2 , 3 ) is a solution for this system. This 
means that it is possible to build a graph with 2 61 vertices, 2 62 
vertices, and 3 63 vertices. ■ 

In the case a schema contains Kleene stars, it is possible to build 
an equivalent diophantine system by introducing natural parame¬ 
ters, as shown in the following example. 

Example 2.11 Consider the graph schemas S = {61,62,63}, 
where: 

61 = (£,a ■ b ■ (c ■ c ■ c ■ c)*) 

62 = ((a-6-c)*,e) 

63 = (c • c, e) 

To build an equivalent system we can associate a distinct pa¬ 
rameter to each occurrence of the Kleene star; in particular, we as¬ 
sociate the parameter hi to the occurrence in 61, and a parameter 
/i2 to the occurrence in 62. The resulting system is the following: 

a. X — h^y = 0 

b. X — h 2 y = 0 

6 . 4 hix — h2y — 2 z = 0 

While this system contains equations that are linear in variables 
X, y, and z, coefficients are no longer constant and can assume any 
value in N. 

■ 

In the case of the schemas of Example 12.101 it is quite easy 
to verify if the corresponding system is consistent and has a non¬ 
trivial, positive integer solution. Indeed, as pointed out in d, it 
suffices to build the convex hull of the set of m-dimensional points 
defined by the columns of the system coefficient matrix and to 
check if 0 is contained in this polytope. 

However, in the case of the schema of Example 12.1 II this ap¬ 
proach can no longer be used. Indeed, the coefficient matrix con¬ 
tains parameters that prevent one from computing the convex hull. 
The problem of the consistency of homogeneous systems of dio- 
phantine equations with parameters has been already studied d. 
I25I1 . In Xie et al. proved that, even if we restrict to linear poly¬ 
nomial of parameters (no nested Kleene stars), there exists a fixed 
k > 2 such that the problem is undecidable if the system contains at 
least k equations (i.e., the schema uses at least k distinct symbol^, 
and that the problem is decidable for systems of 2 equations; in jg) 
Clauss showed that the problem is decidable if the system contains 
a single parameter, two variables, and any number of equations. 

These results motivate the need for a restriction on the class of 
schemas that ensures the non-emptiness of the schema. To develop 
such a restriction, we propose here an approach based on several 
ingredients. The first one consists of restricting the kind of regular 
expressions that can be used in element types. As seen before, 
one source of difficulty is the presence of regular expressions with 
multiple occurrences of a symbol. Another aspect that complicates 
the problem is nesting of repetitions: indeed, it is known that the 
consistency of systems of Diophantine equations is undecidable if 
equation degree is greater or equal to 4 t 23 t\ . and nested Kleene 
stars in a schema just increase the degree of the equations in 
the corresponding system. Consider, for instance, the following 
schema: 

61 = (e, (a • (b • (c ■ c ■ 6 • c)*)*)*) 

62 = ((a-b-c)*,e) 

63 = (c • 6, e) 

The corresponding system, which uses four parameters hi, (12, (13, 
and hi, has degree 4 , as shown below: 

a. hix — h2y = 0 

b. hihzx — h 2 y = 0 

c. 4/11/13/14® — h2y — 2« = 0 






Inspired by our previous works uniEi, we restrict here to 
conflict-free (CF) regular expressions, that are expressions where 
i) any symbol may occur at most once (single-occurrence con¬ 
straint), and ii) repetition */+ is only allowed over symbols. By us¬ 
ing conflict-free expressions only, we can avoid the issues related 
to the nesting of repetitions as well as those concerning multiple 
occurrences of the same symbol. 

Conflict-free expressions obey the following grammar: 

T e\a\T-\-T\T-T \ a* \a+ 

and satisfy the single-occurrence constraint: for any Ti -f T2 or 
Ti ■ T2 subexpression of a CF type, sym(Tf) n sym{T2) = 0 
holds. 

The expression a* • & -|- c is conflict-free, while the expression 
used in Example 12.91 in schema element er is not, as the single¬ 
occurrence constraint is not respected there; the expression (a ■ b)*- 
c is another example of a non conflict-free expression: single¬ 
occurrence is met, but the restriction over repetitions is not. 

Existing studies have shown that users tend to define CF ex¬ 
pressions when creating schemas for XML data 0 ]. We believe 
that the same will hold in the context of data graphs as the reasons 
that lead users to adopt CF expressions depend on aspects that are 
orthogonal to the the particular data model at hand: conflict-free 
expressions, indeed, have a semantics that is relatively simple to 
understand by humans, and, at the same time, they allow one to 
describe and constrain a wide class of sequences that arise in the 
context of semi-structured data management. 

The following example shows that, unfortunately, conflict- 
freedom together with properties that characterise schemas (Def¬ 
inition HD are not sufficient to ensure non emptiness of graph 
schemas. 

Example 2.12 Consider the simple graph schema S = {ei}. 
where: 

ei = {a + b,a ■ h) 

In this schema each ei node produces both a b and an a outgoing 
edge. The only nodes that can receive these edges are in turn of type 
ei. These nodes, however, can receive either a b or an a edge, and 
in turn emit other two a and b edges. This implies that no finite 
graph meets this schema. ■ 

An alternative, and equivalent, formulation of the above schema 
is the following one, obtained by distributing element types over the 
union type in a -|- b expression. 

ei = (a, a • b) 

62 = (b, a • b) 

This formulation better highlights that, indeed, there are two 
kinds of nodes that can be generated by schema S: the first one 
is for nodes receiving an a-edge, and the second one is for nodes 
receiving a b-edge. Now, since both a and b are emitted by both 
kinds, we could ensure non-emptiness by modifying the schema as 
follows: 

ei = {a*,a-b) 

62 = (b*,a-b) 

It is easy to verify that this schema is not empty and that 
infinitely many graphs conform to it. The idea underlying this 
modification is that, whenever a symbol a is emitted by multiple 
schema elements in a schema, then each occurrence of a that 
appears in a receiving expression occurs under a rt. This implies 
that there must exist a schema element whose vertices can accept 
as many a-edges as needed. 

As we will see, the generalisation and formalisation of the above 
sketched restriction actually ensures non-emptiness. Before switch¬ 
ing to the formal treatment, it is worth stressing that this restriction 


demands that, whenever a symbol b is emitted by multiple nodes 
ni,..., rifc with different types, any node n receiving at least a 
b-edge is allowed by its type 6n to have have multiple incoming 
b-edges, thus allowing n to be shared by ni,..., via multiple 
b-edges. 

Note that, of course, a similar restriction is needed to for re¬ 
ceived symbols wrt emitted symbols: whenever a symbol a is re¬ 
ceived by multiple schema elements in a schema, then each occur¬ 
rence of a that appears in a emitting expression occurs under a >i=. 
This rules out empty schemas like the one including the following 
node types (note that this schema is obtained from a previous one 
by simply swapping in and out expression). 

ei = (a • b, a) 

62 = {a ■ b, b) 

We identified this restriction after several other attempts with 
other restrictions. While proving non emptiness for these restric¬ 
tions, the main problem we had was that a constructive approach 
(based on trying to build a graph valid wrt the schema in an in¬ 
cremental way) failed because, each time a node of a given type 
was introduced, this could receive (emit) pending edges emitted 
(received) by other nodes already created in the process, but, at 
the same time, this new node introduced other constraints (pending 
outgoing and incoming edges) that existing nodes could not satisfy. 
So this, in turn, triggered the introduction of another node which 
re-creates the same situation, therefore leading to a circular and 
possibly non-terminating process. 

We have identified the above depicted restriction in such a way 
that a terminating constructive approach can be used in the proof of 
non-emptiness. While the restriction we adopt may seem artificial, 
we believe that it does not limit the modelling opportunities for 
the schema designer and that it can be safely adopted in automatic 
schema-learning approaches. Importantly, our restriction does not 
exclude schemas describing graphs where some nodes can receive 
at most one edge with a given label, as illustrated by the following 
example. 

Example 2.13 Consider graphs for representing social networks 
where users publish posts, which are commented and/or liked by 
other users, which in turn can establish friendship relationships 
with other usersQ A well formed schema for this database is <S = 
{^user 7 6post}, where: 

6uBer = {friend*, friond* ■ post* ■ comm6nt6d*) 
6poat = {post ■ commented* ■ liked *, e) 

Of course, a user can publish several posts, while a post is posted 
by only one user. ■ 

In order to formalize the above illustrated restriction, we have 
first to normalize regular expressions. Regular expressions must 
be transformed in Disjunctive Normal Form (DNF), and then the 
whole schema must be normalized (as illustrated before) in or¬ 
der to distribute unions over type definitions. The following ex¬ 
ample illustrates why normalisation is necessary for proving non¬ 
emptiness. 

Example 2.14 Consider the simple graph schema S = {ei}, 
where: 

ei = (a • b • c, a • (b -I- c)) 

It can be easily proved (by contradiction) that this schema is 
empty. However, in its current formulation this schema satisfies the 
restriction sketched above, as each symbol occurs once in every 
incoming/outgoing regular expression and no repetition is used, 


* This is example is borrowed from the official neo4j web site (blog section). 



hence contradicting our previous claim. This is due to the fact that, 
as in ExamDle l 2 . 12 l the current formulation hides that the schema 
actually defines two kinds of nodes. In order to exhibit this property, 
the outgoing regular expression must be normalised, thus obtaining 
the following schema: 

ei = {a ■ b ■ c, {a ■ b) + {a ■ c)) 

Furthermore, the whole schema must be transformed in order 
to distribute element type definitions over the union emerged by 
means of normalisation, thus obtaining: 

ei = (a ■ b ■ c, a ■ b) 

62 = (a ■ b ■ c, a ■ c) 

As it can be observed, this schema formulation does not satisfy our 
restriction. ■ 

Definition 2.15 (Disjunctive normal form) A regular expression 
T is in Disjunctive Normal Form (DNF) if it obeys the following 
grammar. 

T ::= C + ... + C 

C ::= e I a I C-C I a* I a+ 

Any regular expression can be transformed in DNF by using 
the function defined below, where Ci denotes a union-free regular 
expression, and UILi Ti denotes T\ ■ ... ■ Tn- 


Definition 2.16 (norm{-)) 


( 1 ) 

norm{e) 

= e 

( 2 ) 

norm{a) 

= a 

(3) 

norm{Ti -b T 2 ) 

= norm{Ti) + norm{T 2 ) 

(4) 

norm{Ti ■ T 2 ) 

= norm(Ur=iUr=i^i-Bi) 

where norm{Ti) = U^-i 
and norm{T 2 ) = Uy^i 

(5) 

norm{a*) 

= a* 

( 6 ) 

norm{a^) 

= a+ 


It is easy to prove that T and norm{T) are equivalent for any T. 
To prove that norm{T) actually transforms any regular expression 
in disjunctive normal form, we need a preliminary lemma. 

Lemma 2.17 Given a CF regular expression T, if T contains a 
single union, then normlfF) is in DNF. 

Proof. We prove the thesis by induction on the level of the parse 
tree of T where the union is located. Assume that the parse tree of 
T contains n + 1 levels, where 0 is the level of the root. 

Base Assume that the union is located on the root of the parse tree 
(level 0). Then T = Ti + T 2 , where Ti and T 2 are union-free. 
Hence, T is already in DNF and norm{T) = norm[Tf) + 
norm{T 2 ) = Ti +T 2 is in DNF. 

Inductive step Assume that the thesis is true for any regular ex¬ 
pression containing a single union Ti + T 2 at level i and 
assume that T contains a single union at level i + 1. Then, 
if we indicate with T' the subterm at level i surrounding 
Ti + T 2 , = (Ti -j- T 2 • T 3 ), where Ti, T 2 , and T 3 are 

union-free. In this case, by applying rule (4) of Definition 
12.161 norm{T') = norm(T"), where T” = {normifFf) ■ 
normifFf)) -\- (norm{T 2 ) ■ norm{Tf)). In T" the union is 
at level i, hence, by induction, norm{T") is in DNF, which 
proves the thesis. 


Lemma 2.18 Given a CF regular expression T, norm{T) is in 
DNF. 


Proof. We prove the thesis by induction on the number of unions 
inside T. 

Base If T contains a single union, the thesis is true by Lemma [2.17l 
Inductive step We assume that the thesis is true for regular ex¬ 
pressions containing n unions. Let T be a regular expression 
containing n -|- 1 unions. We proceed by induction on the level 
of the topmost union. 

If the topmost union is at level 0, then T = Ti + T 2 , where Ti 
and T 2 contain at most n unions; by induction, norm{Ti) and 
norm{T 2 ) are in DNF. Therefore, norm{T) = norm{Ti) + 
norm(T 2 ) is in DNF. 

Assume now that the topmost union T) -|- T 2 is at level i -\- 1 
and the thesis is true for level i. Then, if T' is the subterm 
at level i containing Ti -|- T 2 , T' — (Ti -|- T 2 ) ■ T 3 . Ti, 
T 2 , and T 3 contain at most n unions; hence, by the outer 
induction, norm{Ti), norm{T 2 ), and normifT^) are in DNF. 
If T' = (Ti -f T 2 ) ■ T 3 , then, by applying rule (4) of Definition 
12.161 norm{T') = norm{T''), where T" — (normfTi) ■ 
norm(Tz))-\-{norm{T 2 )-norm{Tf))\ norm{T'), hence, lifts 
the union to level i\ by inner induction, we have the thesis. 


Definition 2.19 (PNorm{S)) Given a schema S, we indicate with 
DNormiS) its double normalisation, i.e., the schema obtained from 
S by first normalising each regular expression in a’s in S, and then 
by distributing element types over unions of normalised expressions 
in Ci’s. 

Formally, DNorm{S) is the set of element types eij = (Ci, Dj) 
such that: there exists Ck = {in, out) in S with norm{in) = 
Cl-b. . . + Cn andnorm{ouf) = D\ + .. . + Dm, andi £ [1 .. . n] 
and j £ [1... m]. 

We can now introduce the class of well-formed schemas corre¬ 
sponding to our restriction. 

Definition 2.20 (Weil-formed schemas) A schema S is well-formed 
if the following holds. 

For any symbol a, if there exist ei = {ini,outi) and 62 = 
{in2,out2) in DNorm{S) such that a occurs in both outi and 
out2 (in\ and m2) then, for any e = {in, out) in DNorm{S), 
any occurrence of a in in (out, respectively) must be under a *. 

Theorem 2.21 Every well-formed schema S is not empty. 

Proof. We first observe that S and DNorm{S) are equivalent, 
that (*) each type in DNorm{S) contains only union-free and CF 
injout expressions, and that (**) DNorm{S) respects the prop¬ 
erties 1 and 2 of Definition 12.51 (these properties can be easily 
proved). We then prove that DNorm{S) is not empty. To this end 
we prove that any schema S' satisfying (*) and (**) has a valid 
graph G having exactly one node for each element type in S' and 
that, for each node of G having type e = {in, out) in S', there 
exists an incoming/outgoing a-edge for each a in injout. Clearly 
this property entails the desired one. 

We proceed by induction on | 5 '|, i.e., the number of schema 
elements in S'. 

For the base case |<S'| = 1 we can build a graph with only one 
node with an a incoming (outgoing) edge for each symbol a in the 
regular expressions of the only type ei = {in, out) of DNorm{S). 
The fact that this graph is valid with respect to DNorm{S) follows 
from (*) and (**). 

Let us consider now the case that IS'I = n -b 1 with n > 1 . We 
pick a type e = {in, out) in S' and build a schema <S(_e by drop¬ 
ping out e = {in, out) from S' and by deleting, in expressions of 
remaining types, every symbol that occurs only in e = {in, out). 










The schema S'_s, still satisfies properties (*) and (**), so by induc¬ 
tion we can assume that there exists a graph G conforming to it and 
that (***) each type in S'_f, has exactly one corresponding node in 
G and that each node of G having type e' = {in', out') there exists 
an incoming/outgoing a-edge for each a in the in'/out'. 

Now, we add a new node n node to G as follows. The new node 
n will contain an incoming (outgoing) a-edge for each symbol 
a in the regular regular expression in (out) of the dropped type 
e = {in, out). In addition, we reactivate erased symbols in types 
of <S(_e and add a corresponding incoming/outgoing edge in each 
node of a type e' = {in', out') in S'_g for which the symbol has 
been reactivated in in'/out'. 

At this point, it may happen that some of the added edges 
are dangling. We will show that we can connect these edges to 
exiting nodes (including n) thanks to the following facts. Properties 
1 and 2 of Definition 12.51 plus (***) ensure that for each new 
dangling a-edge either i) there is an existing (not dangling) a-edge 
g connecting two nodes m and n2 of G (this is the case when 
the added edge is outgoing/incoming for n and the edge symbol 
is already used in types in or ii) the dangling edge has a 

reactivated label a (a label used in e but not in <S(_e) and actually 
multiple of such pending a-edges can exist. 

For the first case i) we can distinguish two sub-cases. 

The first one is that the dangling a-edge is outgoing from n. 
Recall that types corresponding to n, m and n2 are different 
types e,ei,e2. Wlog, assume that the existing a-edge is from 
ni to n2. Before proceeding, observe that at this point we have: 
ei = {ini,outi) and a occurs in outi, e = {in, out) and a occurs 
in out, and e2 = (to2, out2) and a occurs in 62. Since S is well- 
formed, this means that a occurs in m2 under a *, and this implies 
that the dangling edge can be connected to 712. 

The second case is that the dangling edge is an incoming edge 
of n. This case can be proved as above. 

Concerning the case ii) we can distinguish the following two 
sub-cases. 

The first one deals with one or more pending a-edges that are 
outgoing from a node of G or from n. Recall that a is in e but in no 
other type of <S(_e • 

Now, we observe that these edges originates from nodes of 
different types of S', thanks to (***) and to the fact that n has 
exactly one edge for each symbol of e; recall that a has been 
reactivated. Thanks to (**) we have that there must be a type 
e' = {in', out') in S' with a € sym{in'). In the case that only 
one pending a-edge exists, the case is proved since the edge can be 
connected to the node having type e' (observe that it may be the 
case that a pending incoming a-edge has been added for this node; 
in this case the two edges are simply merged). In the remaining 
case is we have more than one pending a-edge; in this case thanks 
to (**), (***) and to well-formedness of S, we have that there must 
be a type e' = {in', out') in S' with a € sym{in') such that a 
is under a *. So the node having type e' is able to receive all of the 
pending outgoing a-edges. 

The second ii) sub-case deals with one or more pending incom¬ 
ing a-edges targeting either a node of G or n. This case is similar 
to the previous one. ■ 

3. RPQs, NREs and GXPath 

RPQs, NREs, and GXPath are graph query languages based on the 
idea of using regular expressions to specify patterns that must be 
matched by paths in the input graph. Given a query q, the result of 
its evaluation over a graph G is always a set of node pairs {v,v') 
such that V and v' are connected by a path p in G matching the 
query q. 


These languages mainly differ in the class of supported regular 
expressions, ranging from standard regular expressions to expres¬ 
sions with counters and nested predicates. 


Regular Path Queries (RPQs) are the most basic language we 
are analyzing here. Given a finite alphabet E, an RPQ r over E is 
defined by the following grammar: 

r ::= 

e|a|r-fr|r-r|r* 

Given a graph G = 
defined as follows. 

{V, E, p), the semantics of RPQs can be 

Wg 

= \ u€V} 

Hg 

= {{u,v) \ {u,a,v) € E} 

In -1- r2lG 

= Inlo u Ir 2 lG 

In • nlc 

= InlG 0 InlG 

I^Ig 



where o is the symbol for the concatenation of binary relations 
and R' denotes the concatenation of R with itself i times. 


Example 3.1 Consider the schema of Example 1 2.7 1 and the follow¬ 
ing query: 

partOf ■ series 

This query selects all the conference papers and relates them 
to the corresponding conference series. In the case of the graph of 
Example l 2 . 2 l the result is the following^ 

{{HopcroftU 67 a, foes)} 


As it can be seen from the example, RPQs can express neither 
branching nor backward navigation, which are introduced by NREs 
queries. 

Nested Regular Expressions (NREs) are an evolution of RPQs 
and form the basis of the path language of SparQL (T 3 |. NREs 
introduce the ability of traversing edges backwards, as in 2 RPQs 
j@], as well as the ability of specifying conditions inside paths. 

NREs obey the following grammar: 

n ::= e \ a \ a~ \ n + n \ n ■ n \ n* \ [n] 

where a~ denotes a backward navigation and [n] allows one to 
express conditions inside a path expression. Given a graph G = 
{V, E, p), the semantics of NREs can be defined as follows. 


Hg 

= {{u,u)\ueV} 

Hg 

= {{u,v) \ {u,a,v) € E] 

Ia”lG 

= {{u,v) \ {v,a,u) € E} 

Im - 1 - n2lG 

— HiIIg U In2lG 

Hi • n2lG 

= HiIg 0 HIg 

HIg 

= 

IHIg 

= {H,w) 1 G Hg} 


Example 3.2 Consider again the graph of Example 12.21 and the 
schema of Example l 2 . 7 l The following query returns all pairs {x, y) 
where x is the author of a paper in a conference series y, but also 
published a paper in a journal z\ 

[creator~ ■ journal] ■ creator~ ■ partOf ■ series 

The result of this query is {{JohnE.Hopcroft, foes)}. Ob¬ 
serve that this query cannot be expressed through RPQs or 2 RPQs. 


^ We use here the value of a node to denote the node itself. 




GXPath is the most powerful language we are examining here 
and has been recently proposed by Libkin et al. in jj^]. GXPath is 
essentially an adaptation of XPath to data graphs. Wrt the previous 
languages, GXPath introduces the complement operator, data tests 
on the values stored into nodes, as well as counters, which general¬ 
ize the Kleene star. 

Among the various fragments of GXPath, we focus here on the 
navigational, path-positive fragment with intersection, described by 
the following grammar. 

a ::= £|_|a|a~ |Q-|-a|a-a| a"*’** | a n a | [a] 


Given a graph G 
defined as follows. 

= {V, E, p), the semantics of GXPath 

Ma 

= {{u,u)\u£V} 

I-Ig 

= {(u, n) 1 3o € S.(u,a,v) £ E} 

Hg 

= {(u,v) 1 (u,a,v) £ E} 

II«~Ig 

= {(UyV) 1 (v,a,u) £ E} 

Jqi -I- a2lG 

= IoiIg U [a2lG 

Jqi ■ a2lG 

= IoiIg ° |[o21g 

la Ig 

= uIL^IoFg 

Jqi n a2lG 

= IoiIg n I 02 IG 

IMIg 

= {(m,!!) 1 (m, v) e Hg} 


Example 3.3 Consider the graph depicted in Figure[2 

Consider now the following query: [_ • (_*) Cl e] • {b + c). This 
query selects all pairs of nodes (a;, y) where x is part of a cycle and 
y is reachable from x through an edge labelled with & or c. ■ 

In the following we will indicate with RPQ, NRE, and GXP the 
three classes of regular expressions we are studying here. 

4. Inference Rules 

In this section we present a type inference approach for typing 
RPQs, NREs, and GXPath queries. The approach we propose here 
is a basic yet useful one. It associates to each query a set of schema 
element pairs; hence, a query q is typed by a set {{ei,ei)}i, where 
d and e' are schema elements describing the nodes at the begin¬ 
ning and at the end of path p matching q. Another advantage of this 
typing approach is that can be performed in polynomial time (The¬ 
orem 14.41 and that it is sound and complete for RPQs (Theorems 
I4.3l and l4.13b . Eor NRE and GXPath queries only soundness holds 
(we will provide counterexamples for completeness). 

Typing rules rely on the judgement defined below. We use the 
meta-variables £ and £i to denote sets of schema element pairs. 

Definition 4.1 (Basic Judgment) \-s q '■ £ is a judgment stating 
that, given a well-formed S and a graph G € |[<S]|, £ is an upper 
bound for |g]G. 


Type inference rules for RPQs, NREs, and GXPath queries are 
shown in Tables 4[T] 4(2 and 4(3 In these rules, o is the operator 
for the usual combination of binary relations, and first{£) = 
{ei I 3ep(ei,ej) € £}. In Table 4[T]rule (TypeEpsilon) types 
€ queries, rule (TypeLabel) deals with forward navigation, while 
rules (TypeUnion) and (TypeConc) type queries with union and 
concatenation, respectively. Rule (TypeStar), finally, deals with 
r* queries. 

In Table 4^ rules (TypeBACKLabel) and (TypeCond) infer 
a type for queries with backward navigation and nested conditions, 
respectively. 

In Table 4(5] finally, rules (TypeAnyLabel), (TypeCount), 
and (TypeIntersect) deal with, respectively, wildcard queries, 
counting, and intersection. 

Table 4.1. Basic inference rules for RPQs. 

(TypeEpsilon) 

Ci G S 

(TypeLabel) 

Ci £ S e'i £ S a £ sym{ei.ouf) n sym{e'i.in) 

\-s a : {{ei,e'i)}i 

(TypeUnion) (TypeConc) 

h 5 ri : £1 hg r2 : £2 hg ri : £1 hs r2 : £2 

P5 ri -I- r2 : £^i U £2 P5 ri • r2 : o £2 

(TypeStar) 

\-s r : £ _ 

r*-.[J.^,£' 


Table 4.2. Additional basic inference rules for NREs. 
(TypeBackLabel) 

Ci £ S e'i £ S a £ sym{ei.in) n sym{e'i.ouf) 
1-5 a~ : {(ei,e')}i 
(TypeCond) 

\-s n : £i _ 

P 5 [n] : first{£i) x first{£i) 


Table 4.3. Additional basic inference rules for GXPath queries. 

(TypeAnyLabel) 

Ci £ S e'i £ S sym{ei.out) fl sym{e'i.in) / 0 
1-5 - : {(ei,e')}i 
(TypeCount) 

\-s oi ■. £ 

i_ ^ m.n I I'M ci 

1-5 0’: £ 

(TypeIntersect) 

I -5 ai : £\ I -5 02 : £2 

l“5 m n 02 : £1 n £2 


Example 4.2 Consider again the query of Example 13.21 To type 
this query, rules (TypeConc) and (TypeCond) are first in¬ 
voked; rule (TypeCond), in turn, invokes rules (TypeConc), 





















(TYPELABEL),and (TYPEBACKLABEL)to type creator~ -journal 
Rule (TypeBackLabel) returns the set {(e5,ei)}, while rule 
(TypeLabel) returns {(ei, 62)}. Rule (TypeConc), hence, re¬ 
turns {(es, 62)}, while rule (TypeCond) returns {(es, 65)}. 

Rules (TypeConc), (TypeLabel), and (TypeBackLabel) 
are called again to type creator~ ■ partOf ■ series, returning 
{(es, ei)} o {(ei, es)} o {(es, e4)} = {(es, e4)}. 

Therefore, the result of the type inference is {(65,65)} o 
{(es, 64)} = {(es, 64)}, as expected. ■ 

The soundness of basic type inference is stated by the following 
theorem. 

Theorem 4.3 Given a well-formed S and a query q on a graph 
G = {V,E,p) € |<S|, ifhsq'- S, then, for each {u,v) £ [glo 
there exists (ei, ej) £ £ such that u G [ei] and v G [e^]. 

Proof. By structural induction on the queries. 

(q = e) Trivial. 

(q = a) Let (u, v) G [ffl|G- By definition of query semantics, there 
exists (u, a, v) G E. By definition of well-formed schemas, 
there exists ei,ej G S such that u G |ei|, v G ej, and a G 
sym{ei.out)r\sym{ej .in) for some k. By rule (TypeLabel), 
(ei, ej) G £■ 

(q = a~) Let (u,v) G |a“|G. By definition of query seman¬ 
tics, there exists (v, a, u) G E. By definition of well-formed 
schemas, there exists ei,ej G S such that u G [ei|, v G ej, 
and a G sym{ei.in) n sym{ej.out) for some k. By rule 
(TypeBackLabel), (ei, ej) G £. 

(q = _) Let (u, v) G |-|g- By definition of query semantics, there 
exists a £ V such that (u, a, v) £ E. By definition of well- 
formed schemas, there exists ei,ej £ S such that u £ [ci], 

V £ ej, and a G sym{ei.out) n sym{ej.in) for some k. By 
rule (TypeAnyLabel), (ei, ej) £ £. 
iq — ri -£ r2) Let (u, v) £ [ri -|- r2|G- By definition of query 
semantics, (u,v) £ |ri]G or (u,v) £ [r2lG- Wlog we can 
assume that (u,v) G |ri|G (the case for (u,v) £ [r2]G is 
symmetrical). By induction, there exist ei,ej £ S such that 
u £ [ci], V £ [cj], and (u, v) G £1, where I-5 n : £1. By rule 
(TypeUnion), (ei,ej) G £. 

(q — ri ■ r2) Let (u,v) £ [ri • r2lG- By definition of query 
semantics, there exists w £ V such that (u,w) G [ni]G and 
(w, v) £ |r2]G- Let I-5 ri : £i and I-5 r2 : £2. By induction 
and by uniqueness of vertex typing, there exist ei, ej, ek £ S 
such that u G |ei|, w £ |ej|, v G [cfc], (ei, ej) G £1, and 
{ej,ek) e £2. By rule (TypeConc), (ei,ej) o (ej, Cfc) = 
(ei, ek) € £. 

(q = r*) The thesis follows from the soundness of typing of union 
and concatenation. 

(q — [n]) Let {u, u) G [MIg- By definition of query semantics, 
there exists v £ V such that (u, v) £ [n|G- Let I-5 n : fi. By 
induction, there exist ei, ej G S such that u G [ei], v £ [ej], 
and (ei, ej) G fi. By rule (TypeCond), (ei, ej) G £. 

(q = a"*’") The thesis follows from the soundness of typing for 
union and concatenation. 

(g = ai n 02) Let (u, v) £ [ai n a2]G- By definition of query 
semantics, (u,v) £ [ai]G and (u,v) £ [a2]G- Let I-5 ai : 
£1 and I-5 02 : £2. By induction and by uniqueness of node 
typing, there exist ei,ej G S such that u G [ei], v £ [ej], 
(ei,ej) £ £1, and {ei,ej) £ £2. By rule (TypeIntersect), 
(ei,ej) £ £. 


The basic type inference approach returns quite simple informa¬ 
tion. This fact is counterbalanced by its polynomial complexity, as 
stated by the following theorem. 

Theorem 4.4 (Complexity of I-5 g : f) I-5 g : £ can be evalu¬ 
ated in polynomial time. 

Proof sketch. To prove the thesis we must first observe that, given 
a query g of length |g|, each rule consumes at least one node in the 
parsing tree of g. This implies that g will be typed by a number of 
rule invocations polynomial in |g|. 

To complete the proof, it suffices to prove that each rule can be 
evaluated in polynomial time. This proof can be done by induction 
on the queries. 

The only non trivial cases are those concerning rules (TypeStar) 
and (TypeCount). To evaluate these rules in polynomial time it 
suffices to recognize that a set £ can be interpreted as the set of 
edges in a schema element graph. Evaluating these rules, hence, is 
equivalent to the computation of the reflexive and transitive closure 
of the graph (bounded, in the case of rule (TypeCount)). The un¬ 
bounded closure can be computed in polynomial time by exploiting 
the WarshalTs algorithm, while the bound closure can be computed 
in polynomial time by relying on the usual squaring method. ■ 

Basic type inference for RPQs is not only sound, but also com¬ 
plete. Proof of completeness relies on a number of definitions and 
properties. 

The first definition specifies the set of paths matching a query. 

Definition 4.5 Given a RPQ q on graphs over a finite alphabet E, 
the set of paths that can match q is recursively defined as follows. 

Paths{e) = {e} 

Paths{a) = {a} 

Paths{q\-\-q2) = Paths{qi) U Paths{q2) 

Paths{qi ■ q2) = Paths{qi) x Paths{q2) 

Paths{q*) = Ui^oPaths{q)' 

The second definition specifies when two nodes u and v are 
connected by a path p in a graph G. A path can be either e (the 
empty path) or a path a ■ p', where a is an edge label, and p' is a 
path in turn. Path concatenation pi • p2 is defined in the obvious 
way. 

Definition 4.6 Given a graph G = {V, E, p), we say that two 
nodes u,v £ V are connected by a path p if either p = t and 
u = V, or p = a ■ p' and there exists a pair u,u' £ V such that 
(u, a, u') £ E andp' connects nodes u' ,v £ V. 

It is easy to prove that, if pi connects u, v and p2 connects v, t 
in G, then pi • p2 connects u, t in G. The following lemma relates 
RPQ semantics, path semantics, and graph paths. 

Lemma 4.7 Given a RPQ q and a graph G = (V, E, p), ifu, v £ 

V are connected by a path p in G, and p £ Paths{q), then 
{u,v) £ [gjG- 

Proof. By structural induction on g. 

(g = e) Trivial. 

(g = a) If p G Paths{q), it must be that p — a. By definition 
of the semantics of RPQs, it follows that each pair of nodes 
u,v £ V that is connected by p = a is in [g]G • 

(g = gi -b q2) Simple induction. 



(q = qi ■ 92) Given that p £ Paths{q\ • 52), we have that p = pi- 
P2 with Pi £ Paths{q),i = 1 , 2 . Also, since p connects (u,v), 
Pi connects (u, ui) and p2 connects (ui, v) for a node ui in G. 
By induction we have (u, ui) £ |9 |g and (ui,v) £ |g]G, so 
the thesis follows by definition of |g]G- 
(q = qi*) Given thatp £ Paf/is(qi*), we have thatp = pi-.. .-pn 
with Pi £ Paths{qi) and i = Since p connects 

(u, v) £ V , we have that (u, ui), (161,112),..., (un-i,Un) in 
G are respectively connected by pi,p 2 ,... ,p„. So by induc¬ 
tion we have pi £ JqIg and the thesis follows by definition of 

Mg- 


We need now to define paths over schemas. 

Definition 4.8 Given a schema S = {eijiLo, we say that two types 
CojCt € S are connected by a path p if either p = e and Co = et, 
or p = a ■ p and there exists e € S such that: 

• a £ sym{eo-out) D sym{e.in), and 

• e, et are connected by p'. 

The following lemma relates RPQ typing and paths over schemas. 

Lemma 4.9 Given a schema S and a query q £ RPQ whose 
inputs are described by S, if\~s 9 : then, for each {et, ej) £ £ 
there exists a path p over S such that p connects {ei,ej) and 
p £ Paths{q). 

Proof. By structural induction on q. Cases q = t, q = a, and 
q = qi + 52 are trivial. 

(q = qi • 52) We have that I-5 qt : St, with i = 1 , 2 , and £ — 
£\ o £2. By induction we have that, for each pair (ei, 63) £ £1 
and (63,62) £ £2, there exist pi connecting (61,63) and 
P2 connecting (63,61) on S. So the thesis follows by taking 
p = Pi •P2. 

(q = qi*) The case is similar to the above once observed that 
(ei,ej) £ £ means that there exist (61,61), (e'i,e2), .. ., 
(6(1, 62) in £ and that each of these couples is connected by 
a path Pi on S such thatpi £ Paths{qi), with i = 1 ,... ,n. 


To prove completeness of our RPQ typing rules we also need 
the following definition and couple of lemmas. 

Definition 4.10 Given a schema S and its double normalisation 
DNorm{S), we indicate with Fs the function from DNorm{S) to 
S associating to each type e in DNorm(S) the unique type e' in S 
from which e has been generated (Definition \ 2 . 19 i . 

Also, for every e' in S, Fg^{e') is the set of element types 
generated by e' by means of double normalisation. 

Lemma 4.11 Given a schema S with the pair of element types 
(61, 62), if this pair is connected by p on S, then there exist £ 
Fg^ici) for i = 1 , 2 such that {e'l, ef) ore connected by p in 
DNorm{S). 

Proof. By induction on |p|. If p = a, then there exist two types 61 
and 62 in S such that a £ sym{ei.out) n sym{e2.in). We have 
norm{sym{ei.out)) = Gi + ... F Gn and a £ sym{Gi) for 
i = 1 ,..., n. Similarly, norm{sym{e2.in)) = G[ + ... F G^^ 
and a £ sym{Gj) for for j = 1 ,..., m. By definition of double 
normalisation we know that there will be 61 and 62 obtained by 
normalisation of 61 and 62 such that e'^.out = Gi and 62.m = ( 7 ), 
which prove the basic case. 


Concerning the case p = a.p with p' = b.p” (if p is empty 
the case has already been proved), by hypothesis we have that 
(61,63) is connected by a and that (63,62) is connected by p'. 
By induction we can assume that there exists in DNormiS) a 
couple (61,63) connected by a and a couple (63,62) connected 
by p', with 63 and 63 in ^^^(63). It is easy to prove that the type 
63” = (63.in, 63.out) is in ^'^^(63) and that it both receives a 
and emits b. So we have that a connects (ei, 63”) and p' connects 
(63', 62) in DNormiS), so the thesis is proved. 

■ 

Lemma 4.12 For any well formed schema S there exists a graph 
G that respects S and such that: if {ei, 62) are connected by p in 
S, then two nodes (ni, n2 ) in G are connected by p, with rii having 
type ei,for i = 1, 2. 

Proof. For any (61, 62) in S by Lemma ld.lll we have that there 
exist 6i £ Fg^ia) for i = 1,2 such that (ei, ei) are connected 
by p in DNormiS). We then prove that the desired G exists for 
DNormiS) with nodes (ni,n2) having types (ei, ei), and there¬ 
fore (ei, 62). 

The proof of the existence of such a G is quite similar to 
that for Theorem 14.131 so we omit the details, and just observe 
that the graph G we are looking for is actually the graph built 
in that proof, which ensures that each node of the built graph 
corresponds to a different type in DNormiS), and that each node of 
a type e' in DNormiS) has a incoming (outgoing) o-edge for each 
a £ symie' .in) (for each a £ symie'.out)). This directly implies 
that we can connect the two nodes (ni, n2) in G corresponding to 
types (61, ei) with the path p. ■ 

We have now all the tools to prove completeness of RPQ typing 
over well-formed schemas. 

Theorem 4.13 Given a well-formed schema S and a query q £ 
RPQ whose inputs are described by S, if\~s q ■ £, then, for each 
ici, Bj) £ £, there exists G £ | 5 | such that |[ql|G contains iu, v), 
where u £ Jei] and v £ |[ej|. 

Proof. By Lemma | 4 ~^ we have that iei,ej) are connected by a 
path p in iS and that p £ Pathsiq). By Lemma 14.121 we have 
that there exists a graph G meeting S an including two nodes 
iu, v) connected by p. So by Lemma 14.71 we can conclude that 
(«, v) £ Mg and u £ Ici] and v £ MI- 

■ 

Theorem 14.131 has important consequences. Indeed, as shown 
by Corollary 14.151 we can prove that, for RPQs, the following 
satisfiability problem can be decided in polynomial time. 

Problem 4.14 (SATix)) Given a well-formed S and a query q in 
a language x, is there a graph G £ |[<S| such that |[9|g 7^ 0 ? 

Corollary 4.15 SATiRPQ) can be decided in polynomial time. 

Proof. Consider a query q and a well-formed S. We must first 
prove that: 

(VG £ | 5 | : Mg = 0 ) Fs 9 : 0 

(=>) Assume that there is no graph G £ | 5 | such that |[ql|G 7F 0 . 
Then, by Theorem 14.131 F5 q : 0 . Indeed, if F5 q : f with 
f 7^ 0 , by Theorem 14. 13 1 there would exist at least one graph 
G £ 5 for which Jq] g 7^ 0 . which is a contradiction. 
i<=) Assume that F5 q : 0 . Then, by Theorem 14.31 there is no 
graph G £ | 5 | such that |q|G / 0 . Indeed, if |q|G / 0 , 
then, by Theorem 14.31 Fs q : £, where f 7^ 0 , which is a 
contradiction. 


















The fact that \-s q : S can be evaluated in polynomial (by 
Theorem l 4 . 4 l l completes the proof. ■ 

While sound, basic type inference is not complete for NREs and 
GXPath queries, as shown in the following example. 

Example 4.16 Consider the following graph schema: 

ei = (e, a + 6) ca = ((c)*, e) 

62 = ((a)*,c) 65 = ((rf)*,e) 

63 = (( 6 )*,d) 

Consider now the query q = [6] • a • c. This query looks for nodes 
having outgoing edges labelled with a and b. As ei prescribes that 
these edges are mutually exclusive, the result of this query is always 
empty. However, the rules first infer the set { (ei, 6 i )} for the nested 
regular expression, and the set {(ei, 64)} for the a ■ c. The inferred 
set, hence, is {(ei, 64)}. ■ 

5. Related Works 

Describing the structure of graphs is a subject that has been ana¬ 
lyzed only in a few papers. Graph grammars l2(ill probably repre¬ 
sent the most widely known approach for describing graphs. As a 
plain string grammar, a graph grammar shows how a graph can be 
generated starting from a source node, by applying a set of pro¬ 
duction rules. As for string or tree grammars, graph grammars are 
used for generating graphs, for transforming existing graphs into 
new ones, or for pattern matching, but they are not suitable for type 
inference. 

TSL is the schema language of Trinity EH], a main-memory 
graph processing system based on the Microsoft ecosystem. By 
using a TSL script, which is compiled in .NET object code, it is 
possible to specify the structure of nodes, which can have richly 
defined values, e.g., those required by BES and DES algorithms, 
as well as the type of outgoing edges; apparently, there is no way 
to describe constraints on incoming edges, which can have any 
cardinality. 

SheX f 2 ^ is a schema language for RDE data. As in TSL, 
in SheX it is possible to describe complex node structures, and, 
unlike in TSL, outgoing edges can be defined by using regular 
expressions. However, just as in TSL, there is no way to specify 
constraints on incoming edges. This means that, for instance, in 
a schema describing cars and car owners, one can impose the 
constraint that a single person can own at most n cars, but not 
the constraint that a car can have one single owner at a time. This 
makes impossible to define empty SheX schemas, but it limits the 
expressivity of the language. 

6. Conclusions and Future Work 

In this paper we described a schema language for data graphs, 
introduced a mild restriction that makes impossible to define empty 
schemas, and used this schema formalism to create a type inference 
system for RPQs, NREs, and GXPath queries. This type inference 
system is sound and, in the case of RPQs, it is also complete, hence 
making possible to check the satisfiability of a query just by looking 
at its inferred type. 

In the near future we plan to work on three directions. First, 
we want to understand if schema well-formedness can be checked 
in polynomial or if double normalization is mandatory. Second, 
we want to better investigate the emptiness problem for a more 
general class of graph schemas and try to relax the 1/* constraint, 
in particular by exploring approaches for checking the consistency 
of systems of linear diophantine equations with linear parameters. 
Finally, we want to study type inference techniques that return more 


detailed information about an input query, e.g., the set of paths that 

the query may traverse in an input graph matching a given schema. 
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